Crispr Detection from Short Reads Using Partial Overlap Graph Ilan Ben-bassat and Benny Chor

نویسندگان

  • ILAN BEN - BASSAT
  • BENNY CHOR
چکیده

Notations: Let n be the number of `-long reads in a dataset sampled from a bacterial genome G, with a coverage of c. Let K be the number of frequent k-mers in the data set, which are not part of a CRISPR repeat (also referred to as irrelevant frequent k-mers). For every frequent k-mer, u, in G, let Fu be the number of reads containing u. Let F be the total number of reads containing some frequent k-mer. We denote by m the maximum number of spacers in a CRISPR array of G, and by p the number of CRISPR arrays in G. We also define several constants to be used in the analysis. Two of these express our current knowledge of the attributes of CRISPR loci: R is the maximum length of a CRISPR repeat, and S is the maximum length of a CRISPR spacer. In addition, we assume that k-mers from repeats are not frequent outside of CRISPR arrays. We also assume that sequences of at least σ bases (where σ is the minimum overlap length), that are a combination of a spacer subsequence and a repeat subsequence, are also not frequent in rest of the bacterial genome, G. We, therefore, denote by I the upper bound on the number of times that a k-mer, whose origin is either a repeat, or a spacer, or a consecutive combination of them, can appear in the genome outside of a CRISPR array. This bound also takes into account isolated copies of such a k-mer indside a CRISPR array (due to degenerated or truncated repeats). For simplicity, we assume that different CRISPR arrays share no common repeats.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CRISPR Detection from Short Reads Using Partial Overlap Graphs

Clustered regularly interspaced short palindromic repeats (CRISPR) are structured regions in bacterial and archaeal genomes, which are part of an adaptive immune system against phages. CRISPRs are important for many microbial studies and are playing an essential role in current gene editing techniques. As such, they attract substantial research interest. The exponential growth in the amount of ...

متن کامل

String graph construction using incremental hashing

MOTIVATION New sequencing technologies generate larger amount of short reads data at decreasing cost. De novo sequence assembly is the problem of combining these reads back to the original genome sequence, without relying on a reference genome. This presents algorithmic and computational challenges, especially for long and repetitive genome sequences. Most existing approaches to the assembly pr...

متن کامل

HINGE: long-read assembly achieves optimal repeat resolution.

Long-read sequencing technologies have the potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce misassemblies, and conservative approaches lead to unnecessary fragmentation. We present HINGE, an assembler that seeks to achieve optimal repeat resolution...

متن کامل

Information-optimal genome assembly via sparse read-overlap graphs

MOTIVATION In the context of third-generation long-read sequencing technologies, read-overlap-based approaches are expected to play a central role in the assembly step. A fundamental challenge in assembling from a read-overlap graph is that the true sequence corresponds to a Hamiltonian path on the graph, and, under most formulations, the assembly problem becomes NP-hard, restricting practical ...

متن کامل

Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly

Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short errone...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015